Otherware

home *** CD-ROM | disk | FTP | other *** search

/ Otherware / Otherware_1_SB_Development.iso / mac / misc / medical / dnaid.cpt / DNAid template / DNAid_manual (TEXT) next >

Wrap

Text File | 1991-06-13 | 40KB | 511 lines

DNAid+ Manual 1 DNAid+ A DNA-oriented sequence editor for the Mac ⌐ FrÄdÄric Dardel & Pierre Bensoussan Laboratoire de Biochimie, Ecole Polytechnique 91128 Palaiseau cedex - France Tel : (1) 69 33 48 83 Fax : (1) 69 33 30 13 DNAid+ is freeware. Feel free to make copies and give them to interested colleagues, as long as the copyright notice is preserved. This Manual refers to version 1.4 of DNAid+. Since it is a preliminary version, it may still contain some undetected bugs. Most destructive ones (i.e. bomb generating) have been eradicated, however some may still be creeping inside rarely used portions of the program. In case you encounter one of these nasty creatures, please send me a bug report (or any comments or suggestions), at the afore mentioned address. Indicate precisely your system configuration, and the relevant conditions of its occurrence (number of open windows, size of the sequence(s), action which resulted in the bomb╔). This may help to improve future versions. FrÄdÄric Dardel Introduction: What DNAid+ can do ? DNAid+ is a DNA oriented full screen editor. It can handle several different sequences which are displayed in different windows. The windows containing sequences have lines of text and lines of numbering interleaved. In these windows, you can type, cut, paste, delete and modify portions of sequences. The theoretical maximum length of sequences is 1,638,399 nucleotides or aminoacids. However, the memory of your Mac will probably get saturated before that and for very large sequences, some functions will become extremely time-consuming. DNAid+ will tell you when memory becomes overcrowded. DNAid+ also has a special window labeled ╟ENZYMES╚ which holds the list of restriction endonucleases currently in memory. It is possible to add or remove restriction enzymes while using the editor and to save changes to disk. In this window, you will be able to select the set of enzymes with which the program will perform searches or restriction maps. Basically, DNAid+ is a multi-window text editor with standard editing and printing functions. But it also has built-in capabilities to suit the specific needs of molecular biologists. As in a word processor, you will be able to cut and paste fragments of DNA sequence to mimic a cloning experiment, and you can also reverse-complement whole or part of your sequence, translate it into aminoacids, search for restriction sites, make restriction maps, or calculate the length of DNA fragments in a multiple endonuclease digestion. DNAid+ also has powerful search functions, which can help you to identify open reading frames, regulatory signals, consensus sequences, potential oligonucleotides annealing sites and any complex or degenerated patterns of nucleotides or aminoacids you specify. Frequently used patterns can be saved to disk and can be automatically loaded upon startup to avoid typing them each time. This software is a standard Macintosh application and is interfaced to the computer environment. This means that you can import or export sequences as ╟Text only╚ files or through the clipboard. DNAid+ runs on all MacIntoshes with at least 512 Kbytes of RAM and the 128K ROMs , and under version 4.1 of the system and all subsequentversions (including System 7.0). Sequence Windows Editing data in sequence windows is straightforward, if you have practiced MacWrite or similar word processors, you will find it very easy. Click at the position where you want to insert or modify data, then type the desired nucleotides or residues. (In uppercase) The numbering lines are not part of the sequence. Think of them as a background on which your sequence is superimposed . Enzyme Window This window contains the list of restriction enzymes currently known to the program. Click on the desired enzymes to select them for restriction mapping purposes. The selected enzymes appear in black in the window, even when the enzyme window is not active. Ñ To Edit a given restriction enzyme or check its characteristics, double-click on its name in the enzyme window. A dialog box will then appear, allowing you to modify its name, cut site & cut position. Restriction endonucleases having a non-palindromic recognition site must be entered twice, with the sites on either strand. Ñ To Delete a restriction enzyme, double-click it and delete its name in the dialog box (using the backspace key). An enzyme with an empty name will automatically be removed from the list. Ñ To Add a new enzyme, double click in the blank region of the window, after the last restriction enzyme. DNAid+ will then display an empty enzyme dialog box. Fill it with the new enzyme name and characteristics and click OK. The new list will be sorted alphabetically before display. Don't forget to save the Enzyme list if you want to keep the changes the next time you will start DNAid+. Characters The standard IUPAC/IUB nucleotides codes are used by DNAid+ to store DNA sequences (see appendix ). In addition to the standard A, T, G & C, these codes allow the introduction of ambiguous nucleotides at given positions in a sequence or a restriction enzyme recognition site. DNAid+ can directly perform restriction mapping on sequences containing ambiguous nucleotides (such as the reverse translation of a protein into DNA), thus allowing you to predict all possible restriction sites that could appear in such a sequence. Blank spaces can also be introduced in DNA files, to indicate regions of unknown sequence. The standard one letter code for aminoacids is used to display protein sequences, together with 'X' (any aminoacid), the blank space (unknown aminoacid) and the underscore character ( _ ), which is produced by DNAid+ when translation of a nonsense codon is attempted. Please note that DNAid+ distinguishes between upper and lowercase characters, therefore all nucleotides and aminoacid codes should be typed in uppercase. Files DNAid+ uses several types of files: sequence files, enzyme files and pattern files. Sequences files can be of three types: DNA, protein or ╟Text only╚. The first two types are private to DNAid+, i.e. they cannot be used by other applications. Text files are standard MacIntosh files which can be read by a variety of applications such as word processors, communications programs╔ DNAid+ can open and save sequences as text files, thus enabling data exchange with other applications. Enzyme Files and pattern files are used by DNAid+ to store information about restriction endonucleases and standard patterns. Upon startup DNAid+ will automatically look for its default files named EnzymeList and PatternList. On HFS volumes, these files should be stored in the same folder as DNAid+. File Menu New Sequence This command creates an empty sequence window labeled ╟Sequence - n╚ (where n is a number starting from zero), and makes it the active window. The default sequence type associated to this new window is linear DNA. To alter the sequence type, use the Get Info command in the same menu (see below). Open Sequence This command brings up a standard dialog box allowing you to select the sequence file to open in the appropriate disk or folder. A new window, bearing the name of your file is opened on the screen and made the active window. The blinking caret is positioned at the beginning of the sequence, ready for editing. The default type of the open sequence is that of the chosen file. Save Sequence as When you invoke this command, the standard save dialog box is displayed, allowing you to select the disk and/or folder where your sequence file will be stored. Depending on the type of the sequence, as specified using the Get Info command, your sequence file will be of the corresponding type (DNA, protein or other). ╟Other╚ type sequences are saved as standard text files, with a carriage return inserted every 60 nucleotides or aminoacid. Such files can be read by word processors such as MacWrite or communication programs such as Versaterm. On the other hand, MacWrite files saved with the ╟Text only╚ option can be directly loaded into DNAid+. In this case, care should be taken to remove invisible characters such as carriage returns or tabs, either by answering OK to the dialog that is automatically called upon loading text files or by using the Purge Illegal Characters command in the Goodies Menu. It is a good practice to save your sequence data as often as possible. Close Sequence This command closes the active (topmost) sequence window, releasing the memory it used. It is equivalent to clicking in the close box at the top left of the sequence window. If the content of the window has been modified, you will be asked if you want to save the changes before closing. You should close sequence windows when you are through with them, to avoid crowding of the screen with windows and saturation of memory with unnecessary data. Print Sequence This command allows the user to produce a printed listing of his data. He will be presented the two standard dialog boxes, in order to choose the print parameters. Depending on the type of the sequence, different results will be obtained : Ñ If the sequence is of Protein or Text type, it will be printed together with the numbering as in the sample below: MEKKYWSCHWKRLVCSAGGASKMSKSSHIGH (Actual print lines : : 30 are 90 characters long) Ñ If the sequence is a DNA sequence, you will be presented a dialog box with several available printing options: 1, 3 or 6 phases translations are available. 1 Phase: If you select this option, you must specify which of the three phases you want to be translated by clicking on the appropriate button (phase 1 begins on nucleotide #1, phase 2 on #2 ╔). You can have the translation start at a desired position (in most cases, the beginning of a coding phase) by specifying its coordinate in the ╟Start translation at :╚ box. If you click the ╟Translate 1 ORF only╚ box, the translation will be printed only up to the first stop codon encountered in that frame. 3 Phases: A printout will be generated with the direct DNA sequence, numbering and translation in the 3 direct reading frames 6 Phases: DNAid+ will generate a printout of the active sequence with both DNA strands, numbering and translation in the six possible reading frames (3 direct+ 3 reverse), as shown in the example below: GATCTACCGAGGTATACAACGGGCTATGCCT (Actual print lines : : 30 can be of any desired length) CTAGATGGCTCCATATGTTGCCCGATACCGA AspLeuProArgTyrThrThrGlyTyrAla IleTyrArgGlyIleGlnArgAlaMet SerThrGluValTyrAsnGlyLeuCys Ile___ArgProIleCysArgAlaIleSer AspValSerThrTyrLeuProSerHis ArgGlyLeuTyrValValPro___Ala The ╟Bases per Line╚ box allows you to select the width of the printout. The default is 90, but it can be increased to accommodate for wide printing or reduced printing. If the ╟Print to file╚ check box is ticked, the output will be send to a text file instead of the printer, allowing for further editing with a word processor. If you want to print the DNA sequence only, you can do so by temporarily changing the sequence type to Text with the Get Info command. Printing is supported on both ImageWriter and LaserWriter printers. Open Enzyme File This command allows you to open an alternate restriction enzyme file, previously saved with the following Save Enzyme File command. It brings up a standard dialog box asking you to select the file you want to load. Only files of the appropriate type will be listed in the dialog. You can have as many different enzyme files as you want on your DNAid+ disk. For instance you may have one file for commercial enzymes, one file for enzymes currently in stock in your lab╔ Only one will be active in the enzyme window, but you can change it as often as needed. Upon startup, DNAid+ will look for a default Enzyme file to load. This file should be called EnzymeList and be located in the same folder as DNAid+. You may modify the EnzymeList file supplied on DNAid+, add or remove enzymes and so on. The modified file will be loaded automatically thereafter. Save Enzyme File This command is used to save the current characteristics of the restriction endonucleases listed in the enzyme window. You can use it to create alternate files or to modify the default file named EnzymeList which is automatically loaded by DNAid+ upon startup. Get Info This commands displays a dialog box which indicates the name, length and type of the active sequence. The type of the sequence determines the file type under which it will be saved by the Save Sequence As command and also affect the commands available in the menus (you don't make restriction maps on protein sequences ! ). The check box labeled circular affects the way the program calculates restriction maps on this particular DNA sequence. If it is checked, the sequence will be considered as a circular DNA molecule and the fragment sizes will be calculated accordingly. It will also search for restriction sites overlapping the beginning and the end of the sequence. The circular/linear nature of a particular DNA is saved together with the sequence on file, so that it will be remembered by DNAid+ the next time you will open it. Quit Use this command to leave DNAid+ and return to the Finder. If some sequences loaded into the editor have been modified, you will be asked if you want to save them before leaving. Edit Menu Undo This command is not implemented in this version, it is however present in this menu to ensure compatibility with some desk accessories that may use it. Cut The cut command copies the selected portion of the sequence (which appears in black) in the clipboard and deletes it from its current position. It can then be pasted into another part of the same sequence, another sequence window or another application. (When you leave DNAid+, the clipboard is preserved). The cut command may also be used by some desk accessories. Copy The copy command is similar to the cut command, except that the selected sequence is not deleted from its original location before being placed in the clipboard. Like cut, copy is also available for desk accessories. Paste Use this command to paste a sequence that is stored in the clipboard. This sequence may have been previously cut or copied in a sequence window or in another application. Clear This command deletes the selected sequence without transferring it to the clipboard. Be careful when using it since you wont be able to recover the cleared sequence afterwards. If no sequence is selected, then nothing will happen. Using this command is equivalent to typing the backspace/delete key while a portion of the sequence is selected. Select All This command selects the whole content of the active sequence window, without having to actually drag through it. You can then cut or copy the entire sequence. Filter Key This command is used to enable or disable a routine that filters the characters typed by the user. It functions as a switch which turns on and off the filter. When filtering is enabled, a check mark appears in the menu before the command . Filtering depends on the type of the sequence, as defined by Get Info. If it is a DNA type sequence, only uppercase character corresponding to nucleotides (A, T, G , C) or combination of nucleotides (N, R, Y, M, W, S, K╔) are allowed. Also allowed is the blank character (space) which can be used for unknown nucleotides. For protein sequences, all standard one letter code are accepted, plus X which represents any residue and the underscore character ( _ ) which is produced by DNAid+ as "translation" of a stop codon. For other sequence (Text), any alphabetic character is accepted. When the filter is disabled, no check is performed. Upon startup, DNAid+ will position the Filter Key switch on. Goodies Menu DNA -> Protein This command can be used to determine the aminoacid translation of a given DNA sequence. A new sequence window will be created and the protein sequence deduced from the DNA sequence of the active window will be displayed in it (in one letter code). If a portion of the sequence is selected, you will be asked whether the translation is to be performed on the whole sequence or only on the selected range. This latter option is particularly useful in combination with the pattern matcher which can automatically search and select open reading frames in your sequence DNAid+ will automatically set the type of this translated sequence to protein, as you can check by looking at the info window (Get Info command). If there is no selected range or if you choose to translate the whole sequence, the translation starts from the first nucleotide in the input DNA sequence. Stop codons are shown as underscore characters ( _ ). If you try something horrible such as translating a protein sequence into protein, DNAid+ will bring up an alert box, asking for confirmation of such a heretic manipulation. Protein -> DNA (Reverse translatase !) With this command, you can "reverse translate" the protein sequence of the active window into a DNA sequence containing ambiguous nucleotide codes. For example aminoacids such as Glycine or Lysine are translated into GGN and AAR (N is A/T/G/C and R is A/G). A new window of DNA type is created to display the new sequence. The aminoacids which can be encoded by six codons are a problem, since it is not possible to give them a unique translation. For instance Arginine is encoded by CGN and AGR codons. DNAid+ assigns MGN (M is A/C) as back translation for Arginine, since it the smallest set of codons composed of ambiguous nucleotides that encompasses the six arginine codons. However, MGN also encompasses AGT and AGC which are not arginine codons. Reverse translation is particularly useful in combination with the Restriction Mapping function. DNAid+ can perform searches for restriction sites on such degenerate DNA sequences. The sites found will indicate the positions at which enzyme recognition sites could be introduced by mutagenesis or DNA synthesis, without affecting the aminoacid translation. Attempts to reverse translate a DNA sequence into DNA will result in an alert box asking for confirmation of this bizarre manipulation. Complementary Strand This command takes whole or part of the active DNA sequence and transforms it in situ into its complementary sequence. This corresponds to two successive operations: Inversion of the sequence and replacement of each nucleotide by its respective complementary nucleotide (including ambiguous codes). This operation is done within the active window (i.e. no other window is created), therefore your sequence will be modified by this manipulation. If you have a portion of your DNA sequence selected, DNAid+ will ask whether the complementation is to be performed on the entire sequence or on the selected range only. This latter option is particularly useful to invert a restriction fragment within a given sequence, with no need to cut and paste in an ancillary window. If no portion of the active window is selected (i.e. the cursor is a blinking caret), DNAid+ will perform the complementation on the whole sequence. If you try to complement a protein sequence, you will get an alert box. Purge Illegal Characters When you import a sequence from an other application (word processor, communication program╔), it may happen that some invisible control characters are embedded within your sequence (tabs, carriage returns, linefeeds╔). You can detect this as the sequence line ends will not align properly with the numbers. Purge Illegal Characters can be used in this case to exterminate all "unorthodox" nucleotides or aminoacids. Lines starting with a semi colon (╥;╙) will be considered as comments and will be automatically removed (such comment lines are found at the beginning of database entries). This command uses the same filter as the Filter Key switch (Edit Menu). If your sequence is of the DNA type, all non-nucleotide characters will be eliminated. In protein sequences, all non-aminoacids characters will be suppressed. The Text (Other) type allows all alphabetic characters. Check that the type flag is correctly set in the Get Info dialog box before using this command. Warning : Lowercase characters are not recognized as valid aminoacids or nucleotides. Purge illegal characters therefore converts lowercase to uppercase. Molecular Weight This command allows you to calculate the molecular weight of a protein or peptide. If you select a region of your sequence, such as a possible tryptic peptide, and then activate the Molecular Weight command, DNAid+ will calculate the molecular weight of the selected sequence and display it in a small dialog box. If no sequence is selected, then the molecular weight of the whole sequence will be computed. DNAid+ calculates molecular weights either using the average mass or monoisotopic molecular mass of aminoacids, that is the molecular mass obtained by adding the atomic mass of the most abundant isotopes of C, N, O, H and S. The monoisotopic molecular weight can be used for interpretation of mass spectrometry analysis of peptides. HCA Plot This function is a preliminary implementation of the "Hydrophobic Cluster Analysis" of protein sequences described by Gaboriaud et al. (FEBS Lett. 224: 149-155, 1987), and follows the notations defined by these authors. It can be used to visualize regions with propensities to fold in a-helices or extended conformations, or to find visually regions of conserved conformations. The program outputs the 2D representation of the sequence, the "HCA plot", to the printer (not to the screen). The Graphic operations involved are lengthy, so be patient. Immediately below each line of the HCA plot is shown a Garnier-Robson secondary structure prediction drawn to the same scale. Black boxes indicate a-helices and black lines indicate b-sheets predicted by the Garnier-Robson algorithm. Restriction Menu Select all Enzymes This command can be used to select all the enzymes of the enzyme window in a single manipulation. This is particularly useful when you want to make a complete restriction map of a DNA sequence. The enzyme window need not to be active (i.e. highlighted) to use this command. It is not important whether some enzymes were already selected or not. Deselect all Enzymes It is the reverse of the previous command. It can be used to deselect a set of enzymes in a single operation, without having to explicitly click on every selected enzyme. The enzyme window need not be active to use this command. Smart Enzyme Selection This submenu allows the user to select a subset of the restriction enzymes under various criteria. DNAid+ can scan the list of selected enzymes (those that are highlighted in the enzyme window) and determine for each of them the number of sites they have in the selected sequence of the active window (if no sequence is selected, the search will be performed on the entire content of the window). If the enzyme satisfies the selected criterion (no site, single site, at least one site), it will be left highlighted , otherwise it will be deselected. The typical use of this command is to find an enzyme which cuts into the vector sequence of a recombinant plasmid but not in the insert sequence or vice-versa. To do that, first select all enzymes using the first command of this menu, then select the insert in the window containing the plasmid sequence, using for instance GoTo Next Site & Extend Selection (see below). The no site sub-command will then deselect all enzymes which cut within the insert of your plasmid. The next step is to click anywhere in the plasmid window to deselect the insert sequence. Using the at least one site sub-command will then further remove from your enzyme selection all those which don╒t cleave your plasmid, either in the insert or in the vector. The enzymes which will still be selected after this double screening will be those that cut in the vector but not in the insert. More elaborate selections can also be made using different sequence windows. GoTo Next Site Upon activation of this command, DNAid+ will start to search for the first restriction site of any of the enzymes selected in the enzyme window, starting from the cursor position in the active sequence. When a site is found, DNAid+ will beep, showing a box with the name of the enzyme whose site was found, its coordinate in the sequence, and highlighting the site in the sequence window. You must then click OK or hit return, and the cursor will be positioned at the cut site. You can use this command with a single enzyme selected to go to a given site. Alternatively you can use it with one or several scores of enzymes selected to find which one among them cuts closest to a region of interest in your sequence. When two selected enzymes have overlapping recognition sites, such as Sau3A (GATC) and BamHI (GGATCC), the enzyme which has the largest recognition site will be returned by DNAid+. In the BamHI/Sau3A example, the program will find a BamHI site. If the two enzymes have recognition sites of the same length, such as SalI sites which also happen to be AccI sites, the program will return the first enzyme in the alphabetic order. Thus if both AccI and SalI are selected, the SalI sites will be masked by AccI. This restriction also holds for the Extend Selection command, but not for Text Map and Fragment Size. Extend Selection This command is very similar to the GoTo Next Site command, except that instead of positioning the cursor at the cut position of the found site, the entire region of the sequence between the previous cursor position and the cut site is selected. Extend Selection together with GoTo Next Site are very handy commands for simulation of cloning experiments and recombinant DNA construction: Ñ First, you select the set of enzymes which will be used in the construction in the enzymes window. Ñ Second, you use repeatedly the GoTo Next Site command to go to the beginning of the fragment to clone. Ñ Then you activate Extend Selection and your restriction fragment will be exactly selected, between the cut positions of the selected enzymes. Ñ Finally, you can use Cut or Copy to transfer it into the appropriate vector. In the vector sequence you can use the same procedure to select the DNA fragment which will be replaced by the insert (if any). Paste will automatically delete it before replacing it by your fragment. Text Map This command outputs a listing of all the cutting positions of the selected enzymes in the active sequence. The cut sites are displayed by enzyme, with the cut positions sorted by order of appearance. Enzymes which have no recognition site within the sequence are listed separately at the end. The Listing can be either directed to the screen or to the printer. If you choose the screen display, DNAid+ stops after every scroll page, waiting for you to press on the continue button. If the sequence is circular (to set this parameter, see the Get Info command in the file menu), DNAid+ will look for sites which overlap the origin of the sequence. Fragment Size Whenever you need to predict the restriction pattern of a given plasmid by a set of enzymes, you should use the Fragment Size command, after selecting the appropriate enzymes in the enzyme window. DNAid+ will then display the list of fragments generated in the digestion, either on the screen or on the printer, according to the user's choice. The left part of the listing will display the fragments by their order of appearance, while the right part will show them sorted by decreasing lengths, as they would appear on an electrophoresis gel. The numbers between parenthesis allow you to cross-reference the two lists. Search Menu Pattern Matching This command is one of the most complex and most powerful of DNAid+. It brings up a dialog box which enables you to define complex patterns with a special language, and then search your sequence (DNA or protein) for occurrences of such patterns. It is possible to load and save them with commands described below, thus allowing you to constitute a library of useful patterns. When you activate the Pattern Matching command, DNAid+ will display a dialog allowing you to add , delete or modify patterns, select the current pattern which will be used in searches and then start the search. Every pattern is characterized by a name and a definition. It is possible to use the name of a pattern as part of another pattern definition, DNAid+ will replace it by its definition when the search will be performed. Ñ To add a new pattern, click on the NewPattern button. This will clear the definition box and put the default name "NewPattern" in the name box. You must then type the definition of your pattern in the definition box. When your definition looks ok, change the pattern name to whatever you like and click the Valid button to have DNAid+ check its syntax. This will make the newly defined name appear in the scroll box. If you omit clicking the Valid button, the pattern will not be recorded by the program. Alternatively, hitting the return or enter key will have the same effect as clicking the Valid button. Ñ To modify a given pattern you must select it in the scroll box by clicking on it. This will highlight its name and bring its definition and name in the corresponding box. You can then edit both the name and the definition of the pattern. To make the modification effective and have DNAid+ check the syntax of the new definition, click the Valid button. If you omit clicking it, nothing will be changed. Ñ To delete a pattern, select it in the scroll box, then click the Delete Pattern button. Ñ To search a given pattern, you must select it in the scroll box by clicking on it, then click the Search button. DNAid+ starts searching in the topmost sequence window, starting from the current cursor position. The first time a pattern is searched, a built-in compiler translates your pattern into a special structure called a "deterministic finite-state automaton", which can be viewed as a search machine, specifically designed for your pattern. This compilation can take some time, but is done once and for all. All subsequent searches with the same pattern will be very efficient. If the pattern can be found in the sequence, the program will close the Pattern Matching dialog box and select the found occurrence in the sequence. You can then repeat the search using the Search Current Pattern command described below. If the search fails, the dialog box is not closed, so that you can eventually modify the pattern and perform another search. Rules for Pattern Definition : Ñ Any simple string of nucleotides or aminoacids (only non-ambiguous codes) with no blanks is legal: TATAATA KMSKS Blanks are considered as separators by the program, therefore TAT AATA is understood as string TAA followed by string TATG.In most cases this will be the same as TATAATA, however in some complex pattern definitions, it may be relevant. Therefore it is a good practice to avoid unnecessary blanks. Ñ When you want to specify an ambiguity, enclose all the possibilities between square brackets: [AG] can be used to specify a purine (i.e. A or G) [FYW] can be used to specify an aromatic aminoacid (i.e.Phe, Tyr, or Trp) Ñ Alternatively, you can put a minus sign "-" in front of a single character between square brackets to specify anything but that aminoacid or nucleotide. [-T] anything but a T (equivalent to [AGC] for a nucleotide sequence) [-K] anything but a Lysine Ñ The point character "." can be used within strings to replace any residue or aminoacid. It is equivalent to N for nucleotides or X for aminoacids. GG. can be used to define a Gly codon (GG followed by any nucleotide) K.K stands for two tandem lysines separated by any residue Ñ Complex patterns can be formed by combination of the above conventions : [AG]TG defines a start codon (ATG or GTG) [ST][KR][ST] defines a pattern of one basic residue (Lys or Arg), surrounded by two hydroxyl containing aminoacids (Ser or Thr) T[-A]T.TT[AGT]..TG[-A] is also a valid pattern. Ñ Patterns can be combined in two ways: Concatenation or Alternative. In the following lines, expressions of the form <pattern> stand for any legal pattern definition. - Concatenation: <Pattern1>╩<Pattern2> (separated by a blank character) GG. GG. defines two adjacent Glycine codons - Alternative : <Pattern1>╩ or <Pattern2> TGA or (TA[AG]) defines a stop codon (TGA or TAA or TAG) - Concatenation and Alternative can be combined with parenthesis : T(GA or A[AG]) can alternatively be used to define a stop codon (T followed by either GA or by AA or AG) Ñ Be careful when using the or operator. It operates only on the preceding simple string or pattern between parenthesis: TAT AATA or TGA will be interpreted as TAT (AATA or TGA) and is therefore different from TATAATA or TGA. Whenever you are in doubt about the meaning of a definition, use parenthesis, such as in the following example: (T[AT]AATA) or TGA Ñ You can specify repeats of a given patterns with special constructions known as multipliers. The general form of a multiplier is : <pattern> {x,y} This is interpreted by the program as ╟At least x and at most y repeats of the pattern defined by <pattern>╚. T{7,12} defines a stretch of 7 to 12 consecutive Ts [-S]{10,15} defines a stretch of 10 to 15 residues without serine (T(GA or A[AG])){2,4} defines a pattern of 2 to 4 consecutive stop codons Ñ There are legal simplifications of the multipliers : <pattern> stands for exactly x repeats of the pattern defined by <pattern>. <pattern> {x,} stands for at least x repeats of the pattern (no upper limit) <pattern> * stands for any number of repeats of the pattern defined by <pattern>. ATG ...* T(GA or A[AG]) will search for open reading frames, i.e. an ATG, followed by any number of triplets (...*), followed by a stop codon. T{12,} defines a stretch of at least 12 T (up to any number) CG{5} defines a stretch of exactly five consecutive CG (this is equivalent to CGCGCGCGCG) Ñ Multiplier structures take a lot of memory and are longer to compile. More specifically, avoid {x,y} multipliers with large values of x and y or where y is much larger than x. Whenever possible, replace them by {x,} or * structures which are much less memory consuming. Ñ You can use a previously defined pattern name in the place of its definition. For instance , if you defined start as ATG stop as T(GA or A[AG]) triplet as ... then you can define orf as start triplet* stop this pattern is equivalent to the open reading frame pattern shown above, but is easier to type and much more readable (one start, any number of triplets, one stop). Pattern names should contain only alphabetical characters or the underscore character "_" and always be typed as lowercase characters so they won't get mistaken for nucleotides or aminoacids which are coded for by uppercase characters. Sample Patterns definition name (comment) [AG]TG start_codon T(GA or A[AG]) stop_codon [-T].. or T([TC]. or (G[-A] or A[CT])) coding_codon (i.e non-stop) [AG]TG coding_codon{20,} T(GA or A[AG]) orf (>20 residues) CG([-C] or C*[AT])* CG noCG (stretch with no CG) T[AG]. [AG]{2} T pribnow_box [CH].{1,2}[CH] .* [CH].{1,2}[CH] zinc_finger (zinc binding consensus) G.GK[ST] atp_site (ATP binding consensus) Search Current Pattern You can use this command to repeat a search, either in the same sequence window or in another sequence window, using the last pattern selected with the Pattern Matching command. If no pattern is selected, you will get an alert box. Align When this command is activated, DNAid+ will take the two sequences in the two frontmost windows and try to align them according to their homologous regions. This program was mostly designed to align very homologous sequences, for instance DNA sequence gel readings of overlapping clones, which are expected to be identical except in poorly resolved regions of the gels. By tuning its parameters with the following command, Align Parameters, it can be possible to align less homologous DNA sequences, and protein sequences. Align works fine with sequences a few hundred nucleotides long. Attempting to align longer sequences dramatically increases calculation time and will most probably cause a memory overflow, unless very stringent parameters have been selected (see next command). Align first calculates the homology, then asks you where is the alignment to be displayed: screen or printer. Then he will output the two aligned sequences, introducing gaps where necessary, and numbering sequences at the end of lines. Align Parameters This command can be used to modify the parameters of the Align routines. The program searches for homology vectors, that is, stretches of consecutive nucleotides or residues identical in the two sequences (see Dardel, CABIOS (1985) 1,173-175). These vectors are taken into account by the program only if their length exceeds a threshold value: the minimal match length. This parameter is originally set to 4, which is a good value for short DNA sequences. Increasing its value increases stringency of the alignment but also diminishes the amount of storage required. When a memory full alert occurs, you should try to increase this value to align sequences. The second parameter is the maximal gap size, that is the maximal number of nucleotides that the program will insert or delete at a given position to continue alignment. This parameter does not modify the amount of storage required or the calculation time, therefore you can change it freely. However very large values (>30) are likely to give nonsense results (the best alignment will be missed). Load Pattern Set This command allows you to load a previously defined set of pattern into memory. The patterns already defined in memory are not destroyed, but instead the new patterns will be added to the list. Whenever a name conflict occurs, that is when the loaded file contains a pattern with the same name as a pattern currently in memory, the pattern on disk is not loaded. You can have as many pattern files as you like, but since the patterns do take a lot of memory, DNAid+ only allows you to have a dozen of them defined in memory. This is because associated to the pattern are graph-like structures, called deterministic finite state automata, which allow very fast searches but take large amounts of storage. You should delete unused patterns to free memory when a memory full error occurs. Save Pattern Set This command allows you to save the list of currently defined pattern to disk. For each pattern, both name and definition are stored, but not the associated finite-state automaton which is re-calculated from the definition upon loading. with Load Pattern Set. This allows to save disk space . Find With Errors This command can be used to perform simple searches that do not require the pattern matcher. Upon activation of this command, you are presented a dialog box where you can type the string to be searched. You can also indicate the number of mismatches allowed. in the search. If this number is zero, an exact search will be performed. With a number of mismatches greater than zero, you can for instance look for the possible annealing positions of a mutagenic oligonucleotide in your DNA sequence. The search starts at the current cursor position in the active sequence. If the search string is found, the matching region in the sequence will be selected, otherwise, DNAid+ will simply beep. Find Same This function repeats the last search with the string and number of mismatches defined in Find with Errors. APPENDIX IUPAC/IUB codes for ambiguous nucleotides used by DNAid+ A Adenosine T Thymidine G Guanosine C Cytosine W = A or T S = G or C Y = C or T R = A or G M = A or C K = G or T B = T, G or C V = A, G or C H = A, T or C D = A, T or G N = A, T, G or C blank = none